multimodal knowledge
GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation
Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Ukraine (0.04)
- Europe > Belgium (0.04)
- (14 more...)
- Research Report (0.64)
- Overview (0.46)
LeMoLE: LLM-Enhanced Mixture of Linear Experts for Time Series Forecasting
Zhang, Lingzheng, Shen, Lifeng, Zheng, Yimin, Piao, Shiyuan, Li, Ziyue, Tsung, Fugee
Recent research has shown that large language models (LLMs) can be effectively used for real-world time series forecasting due to their strong natural language understanding capabilities. However, aligning time series into semantic spaces of LLMs comes with high computational costs and inference complexity, particularly for long-range time series generation. Building on recent advancements in using linear models for time series, this paper introduces an LLM-enhanced mixture of linear experts for precise and efficient time series forecasting. This approach involves developing a mixture of linear experts with multiple lookback lengths and a new multimodal fusion mechanism. The use of a mixture of linear experts is efficient due to its simplicity, while the multimodal fusion mechanism adaptively combines multiple linear experts based on the learned features of the text modality from pre-trained large language models. In experiments, we rethink the need to align time series to LLMs by existing time-series large language models and further discuss their efficiency and effectiveness in time series forecasting. Our experimental results show that the proposed LeMoLE model presents lower prediction errors and higher computational efficiency than existing LLM models.
- Asia > China > Hong Kong (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model
Xu, Yingxue, Wang, Yihui, Zhou, Fengtao, Ma, Jiabo, Yang, Shu, Lin, Huangjing, Wang, Xin, Wang, Jiguang, Liang, Li, Han, Anjia, Chan, Ronald Cheong Kin, Chen, Hao
Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or vision-captions data, disregarding invaluable pathology reports and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Here we curated the largest multimodal dataset consisting of H\&E diagnostic whole slide images and their associated pathology reports and RNA-Seq data, resulting in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm which injects multimodal knowledge at the whole-slide context into the pathology FM, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the workflow of pretraining for CPath, which enables the pathology FM to acquire the whole-slide context. To our knowledge, this is the first attempt to incorporate multimodal knowledge at the slide level for enhancing pathology FMs, expanding the modelling context from unimodal to multimodal knowledge and from patch-level to slide-level. To systematically evaluate the capabilities of mSTAR, extensive experiments including slide-level unimodal and multimodal applications, are conducted across 7 diverse types of tasks on 43 subtasks, resulting in the largest spectrum of downstream tasks. The average performance in various slide-level applications consistently demonstrates significant performance enhancements for mSTAR compared to SOTA FMs.
- Asia > China > Hong Kong (0.05)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- North America > Canada > British Columbia (0.04)
- (3 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.71)
- Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (1.00)
- Health & Medicine > Diagnostic Medicine (1.00)
- Health & Medicine > Therapeutic Area > Dermatology (0.93)
- Health & Medicine > Therapeutic Area > Oncology > Carcinoma (0.46)
MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency
Zhang, Junzhe, Zhang, Huixuan, Yin, Xunjian, Huang, Baizhou, Zhang, Xu, Hu, Xinyu, Wan, Xiaojun
Multimodal large language models (MLLMs) are prone to non-factual or outdated knowledge issues, which can manifest as misreading and misrecognition errors due to the complexity of multimodal knowledge. Previous benchmarks have not systematically analyzed the performance of editing methods in correcting these two error types. To better represent and correct these errors, we decompose multimodal knowledge into its visual and textual components. Different error types correspond to different editing formats, which edits distinct part of the multimodal knowledge. We present MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency. Our benchmark facilitates independent correction of misreading and misrecognition errors by editing the corresponding knowledge component. We evaluate three multimodal knowledge editing methods on MC-MKE, revealing their limitations, particularly in terms of modality consistency. Our work highlights the challenges posed by multimodal knowledge editing and motivates further research in developing effective techniques for this task.
- Europe > United Kingdom (0.05)
- South America > Argentina (0.04)
- North America > United States (0.04)
- (2 more...)
Distilling Implicit Multimodal Knowledge into LLMs for Zero-Resource Dialogue Generation
Zhang, Bo, Ma, Hui, Ding, Jian, Wang, Jian, Xu, Bo, Lin, Hongfei
Integrating multimodal knowledge into large language models (LLMs) represents a significant advancement in dialogue generation capabilities. However, the effective incorporation of such knowledge in zero-resource scenarios remains a substantial challenge due to the scarcity of diverse, high-quality dialogue datasets. To address this, we propose the Visual Implicit Knowledge Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs for enriched dialogue generation in zero-resource contexts by leveraging implicit multimodal knowledge. VIKDF comprises two main stages: knowledge distillation, using an Implicit Query Transformer to extract and encode visual implicit knowledge from image-text pairs into knowledge vectors; and knowledge integration, employing a novel Bidirectional Variational Information Fusion technique to seamlessly integrate these distilled vectors into LLMs. This enables the LLMs to generate dialogues that are not only coherent and engaging but also exhibit a deep understanding of the context through implicit multimodal cues, effectively overcoming the limitations of zero-resource scenarios. Our extensive experimentation across two dialogue datasets shows that VIKDF outperforms existing state-of-the-art models in generating high-quality dialogues. The code will be publicly available following acceptance.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > China > Liaoning Province > Dalian (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
Multimodal Machine Unlearning
Machine Unlearning is the process of removing specific training data samples and their corresponding effects from an already trained model. It has significant practical benefits, such as purging private, inaccurate, or outdated information from trained models without the need for complete re-training. Unlearning within a multimodal setting presents unique challenges due to the intrinsic dependencies between different data modalities and the expensive cost of training on large multimodal datasets and architectures. Current approaches to machine unlearning have not fully addressed these challenges. To bridge this gap, we introduce MMUL, a machine unlearning approach specifically designed for multimodal data and models. MMUL formulates the multimodal unlearning task by focusing on three key properties: (a): modality decoupling, which effectively decouples the association between individual unimodal data points within multimodal inputs marked for deletion, rendering them as unrelated data points within the model's context, (b): unimodal knowledge retention, which retains the unimodal representation capability of the model post-unlearning, and (c): multimodal knowledge retention, which retains the multimodal representation capability of the model post-unlearning. MMUL is efficient to train and is not constrained by the requirement of using a strongly convex loss. Experiments on two multimodal models and four multimodal benchmark datasets, including vision-language and graph-language datasets, show that MMUL outperforms existing baselines, gaining an average improvement of +17.6 points against the best-performing unimodal baseline in distinguishing between deleted and remaining data. In addition, MMUL can largely maintain pre-existing knowledge of the original model post unlearning, with a performance gap of only 0.3 points compared to retraining a new model from scratch.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Vision (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering
Wang, Yanan, Yasunaga, Michihiro, Ren, Hongyu, Wada, Shinya, Leskovec, Jure
Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene; "concept graph") multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.83)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.66)
Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph Propagation
Wu, Likang, Li, Zhi, Zhao, Hongke, Wang, Zhefeng, Liu, Qi, Huai, Baoxing, Yuan, Nicholas Jing, Chen, Enhong
Zero-Shot Learning (ZSL), which aims at automatically recognizing unseen objects, is a promising learning paradigm to understand new real-world knowledge for machines continuously. Recently, the Knowledge Graph (KG) has been proven as an effective scheme for handling the zero-shot task with large-scale and non-attribute data. Prior studies always embed relationships of seen and unseen objects into visual information from existing knowledge graphs to promote the cognitive ability of the unseen data. Actually, real-world knowledge is naturally formed by multimodal facts. Compared with ordinary structural knowledge from a graph perspective, multimodal KG can provide cognitive systems with fine-grained knowledge. For example, the text description and visual content can depict more critical details of a fact than only depending on knowledge triplets. Unfortunately, this multimodal fine-grained knowledge is largely unexploited due to the bottleneck of feature alignment between different modalities. To that end, we propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings via a designed dense attention module and self-calibration loss. It makes the semantic transfer process of our ZSL framework learns more differentiated knowledge between entities. Our model also gets rid of the performance limitation of only using rough global features. We conduct extensive experiments and evaluate our model on large-scale real-world data. The experimental results clearly demonstrate the effectiveness of the proposed model in standard zero-shot classification tasks.
- North America > United States > California > Los Angeles County > Long Beach (0.05)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- (3 more...)
Combo of Thinking and Observing for Outside-Knowledge VQA
Si, Qingyi, Mo, Yuchen, Lin, Zheng, Ji, Huishan, Wang, Weiping
Outside-knowledge visual question answering is a challenging task that requires both the acquisition and the use of open-ended real-world knowledge. Some existing solutions draw external knowledge into the cross-modality space which overlooks the much vaster textual knowledge in natural-language space, while others transform the image into a text that further fuses with the textual knowledge into the natural-language space and completely abandons the use of visual features. In this paper, we are inspired to constrain the cross-modality space into the same space of natural-language space which makes the visual features preserved directly, and the model still benefits from the vast knowledge in natural-language space. To this end, we propose a novel framework consisting of a multimodal encoder, a textual encoder and an answer decoder. Such structure allows us to introduce more types of knowledge including explicit and implicit multimodal and textual knowledge. Extensive experiments validate the superiority of the proposed method which outperforms the state-of-the-art by 6.17% accuracy. We also conduct comprehensive ablations of each component, and systematically study the roles of varying types of knowledge. Codes and knowledge data can be found at https://github.com/PhoebusSi/Thinking-while-Observing.
- North America (0.14)
- Asia > China > Beijing > Beijing (0.04)
- South America (0.04)
- (2 more...)
- Transportation > Air (1.00)
- Media (0.93)
- Leisure & Entertainment (0.93)
- Aerospace & Defense (0.93)